[tx] Implement cutlass kernel for ragged_dot with group_offset by pcmoritz · Pull Request #896 · NovaSky-AI/SkyRL

pcmoritz · 2026-01-19T07:01:43Z

This brings down the step time of

uv run --with wandb --with tinker==0.3.0 sl_loop.py     base_url=http://localhost:8000     model_name=Qwen/Qwen3-30B-A3B lora_rank=1 max_length=512

with

uv run --extra gpu --extra tinker -m tx.tinker.api     --base-model Qwen/Qwen3-30B-A3B     --backend-config '{"max_lora_adapters": 2, "max_lora_rank": 8, "expert_parallel_size": 8, "train_micro_batch_size": 1, "shard_attention_heads": false}'

from 40s to 20s. I spend some time tuning the tile sizes and also tried different tile sizes / configurations for different settings (e.g. the different projections or low k setting for LoRA), but it only made a very small difference and wouldn't be worth the complexity for now.

…o tx-ragged-dot-cutlass

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 8 additional findings.

pcmoritz added 30 commits January 17, 2026 18:04

add cutlass ragged dot

12cc124

update

70dce5f

update

d70e010

update

ac85d10

update

106e4ae

add backward

94eb625

update

cf80c97

update

7f1fe1d

use grouped gemm

1d9df09

update

527c1a0

update

656756f

update

aee36c7

Merge branch 'tx-ragged-dot-cutlass' of github.com:pcmoritz/SkyRL int…

cfb3404

…o tx-ragged-dot-cutlass

update

6e9ead9

update

63914e7

update

f92d00d

fix

ad0bfee

update

70cba86

optimize

3f4dd25

fixes

3f6669d

optimize

7b22f86

try to use clusters

accff8e

update schedule

f1fb36c

try tile size

b1c48f4

update

046a033

optimize

5b14a8a

optimize

23a74e5

simplify

4c86409

simplify

2dcce20

add lto

e731efe

pcmoritz added 22 commits January 19, 2026 16:59

fix

fc8c75b

update

b0e14f3

update

79b0d44

add tests for lora

1a7485f

update

8fa5c2e

update

f6a6a92

update

2db8415

update

ae19ae9

update tiles

988cc06

update

0167744

update

3dfcc14

update

50357b4

update

58956a4

update

447305d

update

c1f1ab1

update

fa814f1

update

f72b285

update

2ef0d7e

update

59be86f

update

ce1eba9

Merge branch 'main' into tx-ragged-dot-cutlass

5d4950d

Merge branch 'main' into tx-ragged-dot-cutlass

0befb74

devin-ai-integration bot reviewed Feb 17, 2026

View reviewed changes

pcmoritz added 7 commits February 23, 2026 10:28

Merge branch 'main' into tx-ragged-dot-cutlass

2c4619e

fix

ca50437

update

00d8a16

update

419e3e9

add decoding

678dfd8

use cuda graphs

fa08c4b

save work

22e1d07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tx] Implement cutlass kernel for ragged_dot with group_offset#896

[tx] Implement cutlass kernel for ragged_dot with group_offset#896
pcmoritz wants to merge 116 commits intoNovaSky-AI:mainfrom
pcmoritz:tx-ragged-dot-cutlass

pcmoritz commented Jan 19, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pcmoritz commented Jan 19, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pcmoritz commented Jan 19, 2026 •

edited by devin-ai-integration bot

Loading